SQL Stress Testing.¶

By César Pérez

Introduction.¶

In this notebook I'll describe a test to compare the performance of two python libraries to manage SQL, sqlite3 and pyodbc. For this exercise, I am going to focus on INSERT and SELECT statements. I've created two auxiliary scripts to increasingly add rows and columns to an SQLite and MSSQL databases and capture the timestamps just before and after the task execution to measure the time taken to complete. Both scripts create or update a CSV file having values such as: operation performed, rows added, existing rows, start / end times and their differece. In this notebook, I won't directly describe the scripts that I've used to test the performace but they are available in the links bellow.

Hardware / Software Specs.¶

  • Processor: Intel(R) Core(TM) i5-10300H CPU \@ 2.50GHz 2.50 GHz

  • RAM: 16.0 GB (15.8 GB)

  • OS: Windows 10 Home x64

  • Python Libraries: sqlite3, pyodbc (4.0.35)

  • Python version: 3.9.13

Python 3 Scripts.¶

  • SQLite: testSQLite.py
  • MSSQL: testMSSQL.py

Limitations.¶

Even if we want to perform the same operation on both databases, the codes cannot be exacly the same. For the SQLite script I've used the method executescript(), and for the MSSQL script I've used the method executemany()

Performance Analysis.¶

To analyse the performance, I'm going to:

  1. Read the two CSV files created with the help of the auxiliary scripts mentioned earlier (one having the data from the SQLite database and the second one having the data from the MSSQL database) into pandas dataframes.
  2. Add an additional 'source' column to each dataframe (the values will be 'SQLite' or 'MSSQL').
  3. Concatenate both dataframes (I'll exclude a column comming from SQLite, 'file_size(KB)'), the idea is to combine both csv sources
  4. From the concatenated dataframe, create two tables, one for INSERT statements and the second for SELECT statements.
  5. Create visuals per source and SQL command.
In [1]:
#1. Env prepare
import pandas as pd
import seaborn as sns
In [2]:
# 2. Read and prepare data
SQLite_data = pd.read_csv('report_SQLite.csv')
SQLite_data['source'] = 'SQLite'
SQLite_data.head()
Out[2]:
cols rows inserted_rows start_op end_op time_diff operation file_size(KB) source
0 1 100 100 2022-07-31 18:03:56.346697 2022-07-31 18:04:00.048014 3.701317 INSERT 160.0 SQLite
1 1 100 100 2022-07-31 18:04:00.063387 2022-07-31 18:04:00.063387 0.000000 SELECT 160.0 SQLite
2 1 300 200 2022-07-31 18:04:00.063387 2022-07-31 18:04:07.452064 7.388677 INSERT 160.0 SQLite
3 1 300 200 2022-07-31 18:04:07.452064 2022-07-31 18:04:07.467377 0.015313 SELECT 160.0 SQLite
4 1 600 300 2022-07-31 18:04:07.467377 2022-07-31 18:04:18.789870 11.322493 INSERT 160.0 SQLite
In [3]:
MSSQL_data = pd.read_csv('report_MSSQL.csv')
MSSQL_data['source'] = 'MSSQL'
MSSQL_data.head()
Out[3]:
cols rows inserted_rows start_op end_op time_diff operation source
0 1 100 100 2023-02-03 07:42:02.832929 2023-02-03 07:42:02.838674 0.005745 INSERT MSSQL
1 1 100 100 2023-02-03 07:42:02.853426 2023-02-03 07:42:02.853426 0.000000 SELECT MSSQL
2 1 300 200 2023-02-03 07:42:02.855421 2023-02-03 07:42:02.865166 0.009745 INSERT MSSQL
3 1 300 200 2023-02-03 07:42:02.866132 2023-02-03 07:42:02.867129 0.000997 SELECT MSSQL
4 1 600 300 2023-02-03 07:42:02.868158 2023-02-03 07:42:02.883086 0.014928 INSERT MSSQL
In [4]:
merged_df = pd.concat([SQLite_data.loc[:, ~SQLite_data.columns.isin(['file_size(KB)'])], MSSQL_data])
merged_df = merged_df.loc[merged_df['time_diff'] < 300] #this step is needed to remove outliers
merged_df_INSERT = merged_df.loc[merged_df['operation'] == 'INSERT']
merged_df_SELECT = merged_df.loc[merged_df['operation'] == 'SELECT']

INSERT Statement.¶

In [5]:
# 3. Results
# Plot the lines on two facets, source https://seaborn.pydata.org/examples/faceted_lineplot.html 
sns.relplot(
    data=merged_df_INSERT,
    x="inserted_rows", 
    y="time_diff", 
    col="source",
    kind="line", 
    hue="cols",
    height=5,
    aspect=.75, 
    facet_kws=dict(sharex=False), 
    alpha = 0.6
)
Out[5]:
<seaborn.axisgrid.FacetGrid at 0x25d0a8632b0>

SELECT Statement¶

In [6]:
sns.relplot(
    data=merged_df_SELECT,
    x="rows", 
    y="time_diff", 
    col="source",
    kind="line", 
    hue="cols",
    height=5, 
    aspect=.75, 
    facet_kws=dict(sharex=False), 
    alpha = 0.6
).set_xticklabels(['','0', '20K', '40K', '60K', '80K', '100K', '120K', ''],rotation=30)
Out[6]:
<seaborn.axisgrid.FacetGrid at 0x25d0a863e80>

Conclusions¶

The results indicates that the MSSQL script and library has a much better performance than the SQLite script. It seems that using SQLite provides portability at the expense of performance.

References¶

Pandas. McKinney, W., & others. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56).

Seaborn. Waskom, M., Botvinnik, Olga. Kane, Drew, Hobson, Paul, Lukauskas, Saulius, Gemperline, David C, … Qalieh, Adel. (2017). mwaskom/seaborn: v0.8.1 (September 2017). Zenodo. https://doi.org/10.5281/zenodo.883859